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problem was in the item selection procedures. This second problem con- 
cerned stepsize. points of entry into the item pools, and information 
cutoff levels. The objective of the current study was to compare the 
one- and three-parameter logistic models using the improved procedures. 
A total of 88 students enrolled in an introductory measurement course 
at the University of Missouri -Columbia served as examinees for the study. 
A counterbalanced test-retest design was employed, in which there were 
two separate test sessions one week apart for each examinee. Comparisons 
were based upon (a) test-retest reliability, (b) ability estimates yielded 
by the procedures, (c) the information yielded bv the procedures, (d) 
the number of items the methods administered, (e) goodness of fit of the 
models based on mean square deviations, and (f) the correlations of esti- 
mated true scores, based on ability estimates, with an outside criterion. 
In addition, an attitude survey was administered after each test session 
to determine student attitudes toward the tailored tests. The results 
of the study indicated that both tailored tests had higher reliabilities 
than a conventional paper-and-pencil test* over the same material. The 
three-parameter procedure had higher test information than the one-parameter 
procedure and the conventional test. Neither procedure yielded satisfactory 
content validity. The attitude survey results Indicated generally favor- 
able student attitudes toward tailored testing. 
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A SuccEssRJL Application of Latent Trait Theory 



TO Taiuored AchievemeiTt Testing 



Tailored testing has been proposed as an alternative measurement 
technique because of its potential for dealing with some of the major 
problems of conventional testing. Conventional testing, in which the 
same test items are given to all examinees, often results in test items 
of inappropriate difficulty being administered to many examinees. If 
test items are too difficult, an examinee may resort to random guess- 
ing or even omission of items, and if the items are not difficult enough, 
the test may not be challenging to the examinee. As a result, the stan- 
dard error of measurement for conventional tests usually is higher at 
the extremes of the ability range, resulting in tests that are most accur- 
ate for examinees of average ability. This restricted range of accuracy 
is reflected in lowered test reliabilities. 

Other problems, such as time limit pressures and the effects of test 
administration differences (Weiss, 1974), may also affect the precision 
of measurement of conventional tests. In order to deal with these prob- 
lems, tailored testing procedures were developed (Lord, 1970). The purpose 
of this report is to describe a successful application of tailored test- 
ing procedures to achievement measurement. First, however, it may be 
helpful to discuss both the rationale and primary characteristics of 
tailored testing, and earlier attempts at its utilization. 

Tailored testing procedures were designed to reduce the errors of 
measurement when estimating an examinee's ability or level of achieve- 
ment by attempting to administer to each examinee only items of appro- 
priate difficulty. This is accomplished by selecting for administration 
items that maximize the information about an examinee's estimated ability 
level. That is, each examinee receives a test which is "tailored" to 
his ability level. This tailoring hopefully results in increased precision 
of measurement. 

The implementation of tailored testing procedures usually requires 
computer capabilities. One reason a computer is needed is that tailored 
testing is often based on item characteristic curve (ICC) theory (Lord, 
1952; Lord and Novick, 1968). ICC theory involves mathematical models 
of sufficient sophistication as to require the use of a computer for para- 
mete'^ estimation. One of the first requirements for tailored testing is 
a precalibrated pool of items from which test items can be selected for 
administration. The calibration of the item pool is usually accomplished 
by using one of several existing calibration programs (Wright and 
Panchapakesan, 1969; Wood, Wingersky, and Lord, 1976; and Urry, 1975) 
on conventional test item response data in order to obtain item parameter 
estimates for the one-parameter or three-parameter models. 

Another step which requires computer capabilities is the operation 
of the tailored testing procedures on an interactive basis with the examinee. 
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This tailored testing program is controlled by a number of program para- 
meters, such as the point of entry into the item pool, the procedure for 
estimating ability (usually either a Bayesian or maximum "likelihood tech- 
nique), the item selection method, and a rule for terminating the test. 

Once the item pool has been created and the procedures implemented, 
there are several problems that may arise. Among these is a possible 
lowering of the quality of item calibrations when it is necessary to 
link small sample calibrations of several tests in order to create a 
sufficiently large item pool. Another problem is the nonconvergence of 
the ability estimation procedure, and a third stems from possible viola- 
tions of the assumptions of the latent trait models. This last case may 
occur when an extension is made from ability testing to the measurement 
of achievement. In the research reported here an attempt to solve these 
problems will be presented. 

There are a number of models available for use in tailored testing, 
most of which belong to a class of models referred to as latent trait 
models. Within this class are a number of ICC models, also known as Item 
Response Theory (IRT) models. The particular models chosen for this study 
are described below. 



Latent Trait Models 



The Rasch (1960), or one-parameter logistic (IPL) model, as described 
by Wright (1977), requires one ability parameter, Oj, for each examinee, 

and one difficulty parameter, b^. , for each item in order to describe the 

interaction of an examinee and an item. In exponentional form the IPL 
model is given by 

eyp(u..(e^ - bj) 

Where u^.j is the «^core (0 or 1) on Item 1 for Examinee j, 8j and b^ are 
as defined above, and P(u4j) is the probability that u.^ is 0 or 1. 

The three-parameter logistic (3PL) model as presented by Birnbaum 
(1968) requires three parameters for each item. As in the IPL model, 
the 3PL model requires one ability parameter for each examinee. The 3PL 
model is given by 

exp(Da..(e. - b.)) 
P.(e.) = P(u.. = 1) = c. Ml - c^) 1 .exp(Di,(9j -b^)) 

where and b^ are as defined above, a^ is the item discrimination para- 
meter, c^ is the item guessing parameter, and D is a scaling constant equal 
to 1.7. 
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Both these models assume that the items are dichotomously scored, 
and that local independence holds. Also, the assumption is made th«t 
the latent trait being measured is unidime *ional. (For a full discussion 
of the assumptions of these models see Lord and Novick. 1968.) Of parti- 
cular significance is the assumption of unidimensionality. When applying 
factor analytic methods to ability tests, generally one dominant factor 
is found. But achievement tests are usually constructed with a goal of 
multidimens . jnal measurement. This multidimensional' ty requires the 
serious consideration of the robustness of the models to the violation 
of the unidimensionality assumption when latent trait models are applied 
to achievement testing. Before making this examination it will be help- 
ful to suirmarize the results of a previous study that used a similar 
tailored testing methodology and that demonstrated that tailored test- 
ing procedures could be successfully applied to a unidimensional voca- 
bulary test (Koch and Reckase. 1978). 



Vocabulary Tailored Testing Study 



The purpose of the vocabulary study was to compare the IPL and 3PL 
models in a tailored testing application to vocabulary ability measure- 
ment. The calibration programs used were the MAX program (Wright and 
Panchapakesan. 1969) for the IPL model and the LOGIST program (Wood. 
Wingersky and Lord. 1976) for the 3PL model. Items were selected to 
maximize the information function (Birnbaum. 1968) for the maximum like- 
lihood ability estimate. 

The results of this study indicated that, while there were some 
problems, either of the two models could be successfully applied to voc- 
abulary ability measurement. In particular, the reliabilities reported 
(a combination of test-retest and equivalent forms reliabilities) were 
r - .77 for the 3PL procedure and r - .61 for the IPL procedure. In 
terms of information, the 3PL procedure outperformed the IPL procedure, 
and. in the ability estimate levels between -2.0 and +.50. the 3PL pro- 
cedure actually yielded greater information than the longer traditional 
paper-and-pencil test. 

One of the problems encountered in this study was the failure of 
the 3PL procedure to converge to ability estimates in nearly one-third 
of the cases. When these cases were included in the analyses the 3PL 
reliability dropped to r = .36. The hypothesis was put forward that 
the cases of nonconvergence occurred because the items in the item pool 
were too difficult for many of the examinees. 



Tailored Achievement Testing 



The vocabulary test in the above study was. of course, an ability 
test, and relatively unidimensional (the first factor accounted for 41% 
of the variance). The measurement of achievement presents quite a differ- 
ent problem. The multi dimensionality of achievement tests raises the 
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uestion of the robustness of ICC theory with respect to the violation 
of the unidimensionality assumption. 

Very little has been published in the literature dealing with appli- 
cations of tailored testing to achievement measurement. In one study 
conducted by Bejar. Weiss, and Kingsbury (1977). a biology achievement 
test was used, but that test was found to have a very dominant first factor. 
Not surprisingly the calibration of the item pool with the ICC model proved 
adequate. The use of the ICC imodel on a one factor achievement test would 
not be expected to differ much from use on a unidimensional ability test. 

Research reported by Brown and Weiss (1977). in which a tailored 
testing procedure was used for an achievement test having several content 
areas, indicated that utilizing inter-subtest branching can provide pre- 
cision of measurement equal to that of the conventional achievement test. 
However, in this study each content area was calibrated separate V. rather 
than together as a multidimensional item pool. Therefore, even though 
tailored testing procedures were applied to a multidimensional achieve- 
ment test, the issue of the robustness of the ICC model with respect to 
violation of the assumption of unidimensionality was not addressed. 

The issue was addressed, however, in a study reported by Koch and 
Reckase (1979). In this study achievement tests were nox calibrated by 
content area, but rather each test was calibrated as a whole. The achieve- 
ment tests used were classroom tests from an undergraduate course i J educa- 
tional measurement. The tests were each calibrated using both the MAX 
program (Wright and Panchapakesan . 1969) and the LOGIST program (Wood. 
Wingersky. and Lord. 1976). yielding for each test IPL and 3PL item para- 
meter estimates. All the tests had items in common, so item calibration 
linking was performed using the Least Squares Method (Reckase. 1979) in 
order to form a large item pool for tailored testing. Then a^counter- 
balanced test-retest design was employed, with each examinee taking both 
IPL and 3PL tests in each of two sessions. For both the IPL and 3PL pro- 
cedures, items were selected for administration to maximize the value of 
the information function (Birnbaum. 1968). 

The results of this study indicated a number of problem areas in 
applying tailored testing to multidimensional achievement testing. Both 
procedures appeared to be inadequate with regard to reliability, with 
r « .44 for the IPL test and r « 0.0 for the 3PL test. In neither case 
did test information equal the infomation yielded by the paper-and-pencil 
test, although the 3PL test came substantially closer than did the IPL 
test Moreover, while the item pool accurately reflected the weighting 
of the content areas in the paper-and-pencil exam, the items actually 
selected by the two procedures showed significant deviation from the con- 
tent distribution of both the item pool and the course exam. It should 
be noted here that no branching among content areas was attempted. The 
purpose was to see if selecting items on the basis of information alone 
would approximate the content area weightings of the item pool. 

One other problem that was encountered was nonconvergence of the 
3PL maximum likelihood ability estimation in about eight percent of the 
cases. Recall that it occurred in almost one-third of the cases in the 



ERIC 



vocabulary study previously discussed. The substantial reduction in 
nonconvergence cases was attributed to the use of an item pool of more 
appropriate difficulty in the achievement testing study. 

A number of possible explanations were suggested for the inadequate 
performance of the IPL and 3PL procedures. Among these were unstable 
item parameter estimates due to small sample sizes, a compounding of that 
instability due to the linking procedures, poor selection of entry points 
into the item pool, the possibility that latent trait models may not bo 
robust with respect to the violation of the assumption of unidiraension- 
ality, and the nonconvergence of the 3PL tailored tests when using maxi- 
mum likelihood ability estimation. 

It is clear from looking at this study that, when applying tailored 
testing to achievement measurement, careful attention must be paid to 
the operational characteristics of the procedures. In order to investi- 
gate the robustness of the ICC model with respect to violation of the 
unidimensionality assumption, it is first necessary to eliminate problems 
such as unstable item calibrations, poor linking procedures, and less 
than optimal operational characteristics. The present study is an attempt 
to do just that. 



Method 



Item Pool Construction 

Calibration The test items that were calibrated for use in the item 
pool were obtained from a series of classroom achievement tests adminis- 
tered in an undergraduate course on educational measurement and evalua- 
tion. Items were taken from six different tests of fifty items eacn, 
covering the content area of educational evaluation techniques. The tests 
were calibrctted using both the MAX program (Wright and Panchapakesan, 
1969), ar.J the L06IST program (Wood, Wingersky, and Lord. 1976), which 
yielded ty a IPL and 3PL item parameter estimates, respectively, bampie 
sizes raiiued from 148 examinees to 316 examinees. The dates of test ad- 
ministration and sample sizes are presented in Table A-1 of Appendix A. 

Linking It would be quite desirable to have a large sample of per- 
haps 1000 examinees to which a single test of 150 items or more could 
be administered. This would obviate the need for linking and would pro- 
vide more stable item parameter estimates. Unfortunately, it is not often 
possible to administer a test to as many as 1000 examinees at one time. 
Moreover, for security purposes it is usually necessary to alter a test 
between administrations, although there may be numerous items in comrnon 
from one administration of a test to the next. Because of this, it is 
generally necessary to link together a series of small sample calibrations 
to get all the item parameter estimates on the same scale. The linking 
is necessary because the item parameter estimates yielded by the latent 
trait calibration programs are only invariant to within a linear trans- 
formation due to the arbitrary nature of the zero point and the unit of 
measurement defined by the separate calibrations (Reckase, 1979). 



-6- 



The linking of the IPL "b" values (item difficulty parameter esti- 
mates) was accomplished using the Major Axis Method (Reckase, 1979). 
Items in common to the tests to be linked were identified, and for each 
test a mean difficulty value was computed for those items in common. 
One of the tests was arbitrarily designated as the calibration base, and 
a second test calibration was linked to it by adding to each item's b- 
value in the second test a scaling constant equal to the difference between 
the mean difficulty values that were computed on the coirmon items. The 
adding of the constant to the second test difficulty values put them on 
the same scale as the calibration base items. At this point the "b" values 
for the common items were combined across these two tests using a weighted 
average procedure based on the sample sizes of the respective calibrations. 
This same procedure was repeated for all of the remaining tests to be 
linked using as a calibration base the composite of previously linked 
tests. 

The linking of the 3PL calibrations was done using the Maximum Like- 
lihood Method. This procedure is more fully described by Reckase (1979), 
and a brief sunwary here will suffice. This method required the use of 
the LOGIST program in order to simultaneously calibrate the tests. The 
test data were first edited into a single large matrix. Items appear- 
ing on Test 1 but not or. Test 2 were coded as not reached for Test 2, and 
in this way were not used for the calibration of Test 2. The items in 
common to the tests ensured that the calibrations were all on the same 
scale. The full matrix of responses and not reached codes were analyzed 
to obtain the "a", "b", and "c" parameter estimates. 

Item Pool Characteristics The IPL and 3PL test procedures used 
identical pools of 183 Items. Table 1 summarizes the means, standard 
deviations, and ranges of the parameter estimates. The correlation 
between the respective "b" values was r « .902. Note that the means 
and standard deviations of the "b" values for the two calibration pro- 
cedures are not directly comparable because the origin and unit of measure- 
ment set by the two calibration programs are not the same. 

The distributions of the item parameter estimates are shown in Figures 
1-A, 1-B, 1-C, and 1-D. Probably the most disturbing aspect of these 
distributions is the positive skewness of the 3PL discrimination values. 
Approximately 75 percent of the items had discrimination values below 
.75. Figure 1-B shows that the 3PL difficulty values were also positively 
skewed. The 3PL item pool did not meet all of the guidelines for item 
pools as set out by Urry (1977). These guidelines include: item discrim- 
ination values should be over .8; item difficulty values should be evenly 
and widely distributed from about -2.0 to +2.0; guessing values should 
be less than .3; and there should be at least 100 items in the pool. 
The IPL difficulty values (shown in Figure 1-D) were much more uniformly 
distributed. 

Figures 2 and 3 show the Information curves for the IPL and 3PL item 
pools, respectively. Again, the 3fL curve is positively skewed, with the 
most information being yielded at tje lower range of the ability scale. 
The IPL item pool information plots shows a considerably more uniform 
curve . 
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Table 1 

Descriptive Statistics of Item Parameter 
Estimates for Tailored Testing Item Pools 





One-Parameter 
Calibrati on 


Three-Parameter 
Calibration 






"i 


^i 






Mean 


-0.030 


.610 


-1.674 


.180 


Median 


-0.074 


.485 


-1.764 


.180 


S. D. 


1.396 


.484 


3.361 


.010 


Skewness 


-0.284 


1.517 


1 .406, 


-2.536 


Low Value 


-5.279 


.01 Ou 


-9.999^ 


.101 


High Value 


3.052 


2.oor 


14.834 


.244 



Note. Both pools contained 183 items. 

^This value was an arbitrary lower limit on the 3PL difficulty para- 
meter estimates. 

'^This value is an upper limit set by the LOGIST program. 

Tailored Testing Procedures 

The procedures actually used for the tailored testing sessions have 
been thoroughly described elsewhere (Koch and Reckase, 1978, 1979; Patience, 
1977), and so only a brief sunmary is given here. 

Tailored testing procedures have three main components: an item 
selection routine, an ability estimation technique, and a stopping rule. 
In this study both the IPL and the 3PL procedures selected items to maxi- 
mize the value of the information function (Birnbaum, 1968) at the most 
recent ability estimate. For the IPL testing procedure the formula for 
item information is given by 

expt-(e. - b.)] . . ^ /ON 

I.(e,) = ^ T= *(e. - b ) (3) 

^ J {1 + expi-(ej - b.)]}^ ^ 

where I.(e.) is the information for Item i at Ability Level e. for Exam- 

1 J w 

inee j, and b^ are as previously defined, and ^{x) is the logistic 

probability density function. For the 3PL testing procedure the formula 
for item information is given by 



1^{Q.) = D^a.^tp[DL.(ej)] - D^a.P.(9^),;;[DL.(ej) - log c.j 



(4) 
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FIGURE 2 
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where I^CeO is the value of the item information at e^, 1.(9^) = a.(ej - 
P.(e.) is the probability of a correct response to Item i given Ability 
q\ and i>{x) and the other parameters are as defined earlier. The total 
test information was defined by Birnbaum (1968) as the sum of the item 
information values: 

L.(8.) » I I.(e.) (5) 

These formulas were used in the tailored testing procedure to compute 
the information for each item at the examinee's current ability estimate. 
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FIGURE 3 
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The item with the greatest information at that ability estimate was then 
administered to the examinee, with the provision that the ^"tonnatlon 
must be greater than .246 for the IPL procedure and .65 for the 3PL pro- 
cedure. These values were chosen based on other research, since they 
minimize errors in estimation. The Information cutoffs were diJTerent 
for the two procedures because the ability scales for the two niodels are 
different. If no item were available with information values above these 
minimums, testing was terminated. 

Before testing began no ability estimates were available for the 
examinees, so initial estimates were assigned to set the starting points 
if ?h2 i tern pool . The Initial ability estimates for this study were set 
by random assignment to be either -1.856 or -1.500 for the 3PL test, and 
to be either -.494 or .496 for the IPL test. These values represent 



difficulty values near the medians of the item pool difficulty distri- 
butions with one on either side of the median. Two different points 
were used in order to provide different initial items from one session 
to the next. The first item was then selected to maximize information 
at the initial ability estimate. If that item were correctly answered 
the ability estimate was increased by a fixed stepsize, and if it were 
incorrectly answered the ability estimate was decreased by a fixed step- 
size. This fixed stensize procedure was used until a maximum likelihood 
ability estimate, the mode of the likelihood distribution, could be 
obtained (i.e., when both correct and incorrect responses were obtained). 
The stepsize used for the IPL procedure was .693, and for the 3PL proce- 
dure it was .4. Each new item was selected to maximize the information 
at the new ability estimate, with the restriction that no item could be 
used more than once. 

Two stepping rules were used for the testing procedures. The tests 
were terminated when there were no items left in the item pool with in- 
formation at the current ability estimate greater than the minimum specified 
above, or when 20 items had been administered. 



Design 

This study employed a counterbalanced design using two sessions 
one week apart. Each session included both a test based on the IPL model 
and a test based on the 3PL model. Counterbalancing was achieved by rever- 
sal of the order of presentation of the two tests from one session to the 
next. The test-retest design was used to facilitate reliability compari- 
sons . 

During the sessions the tests were administered with no perceptible 
break between them. The second test was begun imnediately after the final 
ability estimate for the first test was obtained. Since both item pools 
contained the same items, some of the items in the first test were repeated 
in the second test. Therefore, examinees were told that they might receive 
the same item more than once. The tailored tests were administered on 
Applied Digital Data Systems (ADDS) Consul 980 cathode ray tube terminals 
connected to an Amdahl 470/V7 computer via time sharing option facilities. 



Sample 

Examinees were volunteers from dn undergraduate introductory measure- 
ment course. A total of 88 students participated, 21 male, and 67 female. 
There were 19 juniors, 67 seniors, and 2 graduate students. The tailored 
tests were administered shortly after a classroom test over the same con- 
tent. Examinees were told that the tailored test score would be substituted 
for the classroom test score if they performed better on the tailored 
test, and that they would receive extra credit points for completing the 
requirements of the study. 
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Attltude Survey 

In addition to taking the tailored tests, each examinee was asked 
to fill out an attitude survey at the end of each session. The survey 
had 20 items, written in Likert scale format with a five position scale 
of response alternatives. The surveys were scored with a one for the 
response least favorable toward the tailored test and a five for the res- 
ponse most favorable. 



Analyses 

The research questions in this study included a comparison of test- 
retest reliabilities, goodness of fit, content validity, and total test 
information functions. In addition, comparisons were made between ability 
estimates yielded by the IPL and 3PL procedures, and between the ability 
estimates and outside criteria. Attitudes of the students toward tailored 
testing were also determined. Estimated true scores were used in the 
computation of all the correlations, based on the suggestion of Lord (1979). 

The computation of the estimated true scores was accomplished by 
summing the probabilities of correct responses at the examinee's final 
ability estimate for all the items in the item pool. The formula for 
estimated true scores Is as follows: 

t(e.) = z P.(e.) (6) 

where t(0.) is the estimated true score for Examinee j. 

The reliabilities computed for this study were not strictly test- 
retest reliabilities, but rather a mixture of test-retest and equivalent 
forms reliabilities since the tests in one session were not identical 
to tests taken in the other session. The reliabilities were compared 
using a t-test based on Fisher's I. to z transformation. 

The total test information analyses were done to compare the rela- 
tive efficiencies (Birnbaum, 1968) of the tailored testing procedures 
with respect to the course exam. The relative efficiency was the ratio 
of the information provided by the tailored test at a particular ability 
to the information of the traditional paper-and-pencil course exam at 
the same ability. Plots were drawn of the relative efficiency curves 
for the two tailored testing models based on sample cases selected from 
across the entire range of the tailored testing ability estimates. 

Other analyses run on the data included a series of correlational 
analyses. For instance, correlations between the IPL and 3PL ability 
estimates were run using estimated true scores, as were correlations between 
the ability estimates and course exam scores. The exams that were corre- 
lated with estimated true scores Included the course exam over the same 
content area as the tailored tests as well as two other course exams and 
the sura of all the course exams. The objective of all these correlational 
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analyses was to see whether the IPL and 3PL tests measured the same thing, 
and whether one test correlated more highly with the o jtside criteria. 
The correlations of the tailored test scores and the outside criteria 
were an indication of concurrent validity. In addition to the above 
analyses, descriptive statistics were compiled, including the average 
test lengths, the average test difficulties, and the number of items used 
from each Item pool, for both sessions of the IPL and 3PL tests. 

The goodness of fit statistic used in this study was the mean square 
deviation, calculated by sunming over examinees the squared differences 
between the actual responses to the items and the expected responses to 
the items (probability of a correct response) as predicted by the models. 
The formula for the MSD statistic is 

n (u,, - P.(e-.))^ 

MSD. =: i-J-^ '-^ (7) 

J i=l "j 

where MSD. is the mean squared deviation for Examinee j. u.^ is the actual 
response to Item i by Examinee j. PiOj) is the probability of a correct 
response to Item i by Examinee j determined from the model using the final 
ability estimate and the estimated item parameters, and n^ is the number 
of items in the tailored test for Examinee j. The MSD statistic was com- 
puted for a systematic sample of 29 examinees from across the ability 
range. The IPL and 3PL tests were compared using the MSD statistic as 
the dependent variable in a dependent t-test. 

Content validity analyses were done to determine the degree to which 
the item pools and the tailored tests accurately represented the content 
breakdown of the traditional test. Actual and expected frequencies of 

content samplings were compared using a statistic. Since the argument 
was presented that achievement tests are typically multidimensional, fac- 
tor analyses were also run on the course exam to determine the factor 
structure of the test. Principal components analyses with vanmax rota- 
tions were employed. 

Student attitudes were analyzed using data from the surveys admin- 
istered at the end of each session. The first analysis run on the response 
data was a principal components factor analysis followed by a varimax 
rotation. Once the factor structure was determined, attempts were made 
to label factors and compare them with the factors from Previous adminis- 
trations of the scale reported by Koch and Reckase (1978, 1979). Coeffi- 
cient alpha reliabilities were calculated for each factor as well as for 
the total scale. Response frequencies for the five scale positions were 
tabulated for both sessions to summarizfi student attitudes toward tailored 
testing. Also, multivariate analyses were run to determine if there were 
significant change in attitudes from one session to the next. 
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Results 



Reliability 

Table 2 contains the correlation matrix obtained from intercorrela- 
ting the ability estimates yielded by the two models used in the tailored 
testing sessions. The correlation of r = .57 between the ability esti- 
mates from the first IPL test (IPL 1) and the ability estimates from the 
second IPL test (IPL 2) was the reliability for the IPL procedure. The 
reliability for the 3PL procedure, r « .62, was higher, but not signifi- 
cantly so. The KR-20 reliability of the traditional papei^-and-pencil 
course exam was r = .60. The reliabilities of the tailored tests were 
actually substantially higher than the reliability of the conventional 
test, since normally a KR-20 reliability would be expected to be higher 
than a test-retest reliability. Also, it should be noted that the tailored 
tests were less than half as long as the conventional test. 

Table 2 



Ability Estimate Correlations 



Model 


Session 


1 2 


3 


4 


1. IPL 


1 


1.00 .57 


.35 


.42 


2. IPL 


2 


1.00 


.38 


.44 


3. 3PL 


1 




1.00 


.62 


4. 3PL 


2 






1.00 



Table 3 shows that the tailored test reliabilities were even higher 
when estimates true scores were used in place of ability estimates. Using 
estimated true scores, the IPL reliability was r - .62 and the 3PL relia- 
bility was r = .71. 

Table 3 



Ability Estimate Correlations Using Estimated True Scores 



Model 


Session 


1 2 


3 


4 


1. IPL 


1 


1 .00 .62 


.36 


.44 


2. IPL 


2 


1.00 


.41 


.52 


3. 3PL 


1 




1.00 


.71 


4. 3PL 


2 






1.00 



Information 

The relative efficiency comparison of the total test information 
for the IPL and 3PL procedures is shown In Figure 4. The horizontal 
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broken line represents the relative efficiency of the course exam, which was used 
as a standard for comparing the two procedures. It should be noted that 
the ability scale for the IPL model is not the same as the ability scale 
for the 3PL model. Thus the plots are not comparable on a point by point 
basis. However, an overall visual examination of the plots of informa- 
tion curves for the two models is still possible. 



FIGURE U 



TOTRL TEST INFORMRTION 



COMPARISON OF IPL AND 



3PL TAILORED TESTS 
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Perhaps the most significant result of this comparison is that the 
3PL procedure not only yielded more information than the IPL procedure, 
but in the ability estimate range of -2.0 to +1.5 the 3PL procedure also 
yielded more information than the 50 item paper-and-pencil test. It is 
important to point out that the 3PL procedure performed best in that range 
of ability estimates where most of the examinees were classified, while 
the IPL procedure had its highest relative efficiency at the upper end 
of the range of ability estimates, where few examinees were classified. 
See Appendix B for the distribution of ability estimates for both the 
IPL and 3PL procedures. 



Goodness of fit 

Table 4 presents the results of the goodness of fit comparison of 
the IPL and 3PL models using the MSD statistic. MSD values were computed 
for 29 cases for each model, along with means, standard deviations, and 
the results of a dependent t-test analysis of the data. The results of 
the t.-test indicated that tFe 3PL model fit the observed responses signi- 
ficantly better than the IPL model (g. < .001). 



Correlational Analyses 

Table 5 and 6 show the correlations of the traditional course exam 
scores and total course scores (the sum of the course exam scores) with 
the tailored testing ability estimates and with the estimated true scores, 
respectively. The differences between the correlations of the IPL and 
3PL ability estimates with Exam I were not significant, while the IPL 
correlation was significantly higher than the 3PL correlation with re- 
spect to the total score for the first session (g^ < .05) but not for the 
second. The correlations did not change significantly when estimated 
true scores were used instead of ability estimates. 

One interesting result that Is shown in Table 5 is that the IPL 1 
ability estimates correlated significantly higher with Exam II than with 
Exam I (£ < .05). Moreover, both the IPL 1 and the IPL 2 ability esti- 
mates correlated higher with the total course score than with Exam I 
(£ < .01 for IPL 1 , £ < .05 for IPL 2). Remember that Exam I was the 
course exam over the same material as the tailored tests. One possible 
explanation for this is that the KR-20 reliabilities of Exam II and the 
total course score were higher than the reliability of Exam I. The relia- 
bility of the total course score was comouted according to a method suggest- 
ed by Lord and Novick (1968, pp. 203-204). These reliabilities are shown 
in Table 5. 



Descriptive Statistics 



Table 7 presents descriptive statistics for both sessions of the 
IPL and the 3PL tailored tests. The mean number of items administered 
indicates that the IPL tests tended to be longer than the 3PL tests, and 
that many of the IPL tests went the maximum of 20 items. The mean pro- 



Table 4 



Goodness of Fit Comparison 
Using the MSD Statistic 



UDservations 


One-Parameter 
MSD 


Three-Parameter 
MSD 


1 


.1887 


.1103 


2 


.1833 


,0142 


3 


.1863 


.0832 


4 


.2085 


.1894 


5 


.2123 


.1226 


6 


.2087 


.1394 


7 


.1853 


.0349 


8 


.2107 


.1137 


9 


.2133 


.2273 


10 


.2174 


.1216 


n 


.1923 


.2405 


12 


.2219 


.2515 


13 


.2120 


.1826 


14 


.2197 


.2171 


15 


.2192 


.0728 


16 


.2033 


.1712 


17 


91 7fi 




18 


.2124 


.2024 


19 


.2122 


.2305 






.1010 


21 


.2095 


.0457 


22 


.1883 


.1309 


23 


.2230 


.2107 


24 


.1367 


.0235 


25 


.2086 


.1751 


26 


.2177 


.2281 


27 


.2087 


.1330 


28 


.2137 


.0994 


29 


.2097 


.1693 


X 


.2049 


.1483 




.0425 


.0740 



t/ppx « 5.082 (£.< .001) 



portion of items answered correctly shows that the 3PL procedure admin- 
istered items that were, overall, easier than those items administered 
by the 1PL procedure. 

An important effect related to the 3PL item di«, crimination parameter 
estimates was that only 25 items from the 183 items In the 3PL item pool 
were used by the 3PL testing procedure. On the other hand the 1PL pro- 
cedure used 120 Items from the 183 items in the 1PL item pool. Figure 
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Table b 

Correlations of Ability Estimates with Traditional Course Exams 



Tailored Testing Model and Session 



Traditional 
Course Exam 


KR-20 

Reliability IPL 1 


IPL 2 


3PL 1 


3PL 2 


Fyam I* 

Exam II 
Exam 11 
Total Score 


.60 .42 
.76 .58 
.64 .36 
.75 .68 


.49 
.46 
.35 
.63 


.39 
.36 
.38 
.45 


.42 
.47 
.44 
.52 


*Fx;im I was 


over the same content area as 


the tailored tests. 






Table 6 










Correlations of Estimated True Scores 
with Traditional Course Exams 






Tailored Testing Model and Session 


Traditional 


Course Exam IPL 1 


IPL 2 


3PL 1 


3PL 2 


Exam I* 
Exam II 
Exam III 
Total Score 


.42 
.58 
.37 
.68 


.49 
.46 
.33 
.62 


.40 
.36 
.40 
.46 


.42 
.44 
.44 
.51 



*Exam I was over the same content area as the tailored tests. 



Table 7 

Tailored Test Descriptive Statistics 



One-Parameter Three-Parameter 
Tailored Test Tailored Test 

Variable — ; 

Session 1 Session 2 Session 1 Session 



Mean # of items administered 19.09 

Mean # of items correct 11.07 

Mean proportion of items correct .58 

Mean of ability estimates 1.37 

S.D. of ability estimates .67 



18.11 
10.30 

.57 
1.50 

.92 



16.23 
12.15 

.75 
-.53 

.74 



15.32 
11.71 

.76 
-.36 

.83 
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5 shows the information curve for the 25 items that were used from the 
3PL item pool. The plot shows that the most information yielded by this 
reduced pool was at the lower range of abilities. In fact, for ability 
estimates over +2.0 there were virtually no items available with informa- 
tion above the information cutoff. 



FIGURE 5 



INFORMATION CURVE FOR 25 
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Content Validity 

Table 8 shows the results of the content validity analysis fo>^ ^o^h 
tailored testing models. The Chi-Square test indicated that both the IPI 
and the 3PL item pools accurately reflected the weighting of the content 
areas specified in the table of specifications for the paper-and-pencil 
course exam (see Table A-2 in Appendix A). However, the number of items 
administered by content area for a systematic sample of 21 1PL taiiorea 
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tests and 20 3PL tailored tests showed significant lack of fit to both 
the item pools and the course exam. Also, the content distributions of 
the IPL and 3PL tailored test items were significantly different. It 
should be noted that no attempt was made in the tailored testing proce- 
dures to branch among the content areas. The object was to see if select- 
ing items for administration on the basis of information alone would 
approximate the content area weightings of the item pools and the course 
exam. 



Attitude Survey 

Attitude Scale Characteristics Table 9 shows the varimax rotated 
factor loading matrix obtained from a principal components analysis of 
the first administration of the attitude scale. There were six factors 
present with eigenvalues greater than one, accounting for C2.5 percent 
of the variance. A subjective examination of the items loading on each 
factor resulted in the following factor labels: 

factor I - cathode ray tube (CRT) characteristics 

factor II - perceived test performance/ test satisfaction 

factor III - motivation 

factor IV - anxiety 

factor V - test pace 

factor VI - time pressure/item easiness 
The items appearing on the attitude scale are listed in Appendix C. 

Table 10 shows the rotated factor loading matrix obtained from the 
analysis of the second administration of the attitude scale. This time 
there were five factors present with eigenvalues greater than one, account- 
ing for 62 percent of the variance. After a subjective examination, these 
factors were given the following labels: 

factor I - perceived test performance/test satisfaction 

factor II - motivation 

factor III - anxiety/time pressure 

factor IV - miscellaneous 

factor V - CRT characteristics/item easiness 

Factor analysis results obtained from the two attitude scale admin- 
istrations differed somewhat. For instance, in the first administration 
of the scale» anxiety and time pressure items loaded on separate factors, 
while in the second administration they formed a single factor. Another 
difference was that in the first administration, item easiness items load- 
ed with time pressure items, while in the second administration item easi- 
ness items loaded with CRT characteristics items. Also, in the first 
administration Item 11 loaded by itself, while in the second administration 
it was joined by th>-ee other items in a factor of assorted item types, 
labelled here as miscellaneous. 

A multivariate analysis of variance (MANOVA) was performed to deter- 
mine whether the mean scores on each item were different over the two 
administrations of the attitude scale. The results of the MANOVA indica- 
ted that there were no significant changes. This implied that, regardless 
of the changes in factor structure, student attitudes toward tailored 
testing did not change from one administration to the next. 
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Table 8 

Test Items by Content Area for the Course Exam, 
Item Pools, and Tailored Tests 



Items in Items in 

Course Exam Items in Items in 21 IPL 20 3PL 

Items IPL Pool 3PL Pool Tailored Tests Tailored Tests 



Number 


i 


Number 


% 


Number 


' i 


Number 




Number 




MnecQOua i 
Records 


5 


10 0 


18 


9 8 


18 


9 8 


62 


15.3 


0 


0 


Behavioral 


5 


10.0 


20 


10.9 


20 


10.9 


43 


10.6 


47 


15.5 


Checklists 




10.0 


18 


9.8 


18 


9.8 


37 


9.1 


21 


6.9 


Peer Appraisals 


2 


4.0 


5 


2.7 


5 


2.7 


6 


1 .5 


14 


4,0 


Planning Tests 


3 


D .U 


1 c 


6 6 




6 6 






18 


5 9 


Rankings 


3 


6.0 


9 


4.9 


9 


4.9 


4 


1.0 


20 


6.6 


Ratings 


6 


12.0 


25 


13.7 


25 


13.7 


76 


18.7 


64 


21.1 


Selection Items 


8 


16.0 


30 


16.4 


30 


16.4 


55 


13.6 


59 


19.5 


Self Report 


2 


4.0 


8 


4.4 


8 


4.4 


12 


3.0 


14 


4.6 


Supply Items 


5 


10.0 


20 


10.9 


20 


10.9 


32 


7.9 


9 


3.0 


Table of 
Specifications 


6 


12.0 


18 


9.8 


18 


9.8 


41 


10.1 


37 


12.2 


Total 


50 




183 




183 




406 




303 





Note . Below are the Chi -Square values for several comparisons. The critical 
value for rejection of adequate fit is x^(io) * ^^-^^ " ~ 

2 

1. Course exam items vs. items in IPL pool, x =4.431 

2 

2. Course exam items vs. IPL tailored test items, x 55.078 

2 

3. IPL pool items vs. IPL tailored test items, x = 43.139 

2 

4. Course exam items vs. items in 3PL pool, x * ^-431 

2 

5. Course exam items vs. 3PL tailored test items, x = 80.878 

2 

6. 3PL pool items vs. 3PL tailored test items, x = 77.662 

7. IPL tailored test items vs. 3PL tailored test items, x^ 89.02 
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Table 9 

Principal Components Analysis 
Varimax Rotated Factor Pattern for 
First Attitude Survey Administration 



Factor 



Item No. 


I 


II 


III 


IV 


V 


VI 


1 


-.06 


.24 


-.28 


.47 


.45 


.03 


2 


.15 


.09 


.06 


15T 


-TIT 


.76 


3 


-.23 


.68 


-.12 


.01 


-.40 


j5F 


4 * 


.23 




.19 


.52 


-.10 


.43 


5 


-.08 


.11 


.78 




-.06 


72l 


6 


-.19 


.64 


31 


.29 


.24 


.03 


7 


.22 




.04 


.12 


-.14 


.12 


8 


.71 


"ur 


.08 


-.00 


-.09 


.05 


9 


TfT 


-.10 


.26 


-.20 


.53 


.58 


10 


.02 


.14 


-.13 


.72 


-70T 




11 


.31 


-.04 


.13 


X7 


-.60 


.19 


12 


.25 


.64 


-.10 


-.03 


7SZ 


-.13 


13 


.74 


"HI" 


.06 


.19 


.10 


.04 


14 


TiT 


.65 


.15 


.16 


.22 


-.31 


15 


.41 




-.09 


.62 


-.05 


.24 


16 


.72 


.17 


.13 




-.20 


.23 


17 




.78 


.30 


-.02 


-.05 


.27 


18 


-.31 


-.05 


.43 


.49 


-.01 


-.10 


19 


.18 


.04 


770 


-TIT 


-.16 


.02 


20 


.32 


.19 




-.09 


.16 


-.00 



Note . The underlined values indicate the highest loadings of an item 
on a factor. Broken underlines indicate other high loadings. 



A comparison of the results of the attitude scale administrations 
for this study with results from previous admir'-trations of the scale 
indicated several differences. For Instance, i.. the earliest administra- 
tion of the scale (Koch and Reckase, 1978) anxiety and time pressure items 
loaded on separate factors, while in a subsequent study (Koch and Reckase, 
1979) they fomed a single factor. In the present study, they loaded on 
separate factors in the first administration, and on the same factor in 
the second administration. In both of the earlier studies perceived 
test performance and test satisfaction items loaded on separate factors, 
while in the present study they formed a single factor in both adminis- 
trations. 

Two types of reliability measures were computed for the attitude 
scale. First, a test-retest reliability coefficient was computed between 
the sets of total attitude scores for the two administrations. A value 
of r = .71 was obtained for th^s reliability measure. The second type 
of reliability measure calculated for the attitude scale was a coefficient 
alpha reliability. Coefficient alpha reliabilities were computed for each 
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Table 10 

Principal Components Analysis 
Varimax Rotated Factor Pattern for 
Second Attitude Survey Administration 



Factor 





t 


IT 


TTT 
ill 


TV 


w 

V 


1 
1 




AH 




.Of 




o 

c, 




.L.0 


.00 


• • cc 






.CO 


10 

. 1 7 


.U 1 


17 


1Q 


4 


-.17 


.39 


.64 


Tf 


.25 


5 


.25 


.77 


-TTT 


-.08 


.08 


6 


.58 


-TTT 


.15 


.43 


-.25 


7 


775" 


.25 


.07 


ToF 


-.03 


8 




.15 


.18 


.13 


.79 


9 


-.03 


.34 


-.12 


-.26 


-J^ 


10 


.23 


-.27 


.75 


.27 




n 


.26 


-.05 


1^ 


-.65 


.24 


12 


.65 


.03 


-.10 


157 


.40 


13 




.00 


.72 


.07 


.45 


14 


.51 


.02 




.69 


-.09 


15 


.36 


-.39 


.58 


-J5I 


.10 


16 


.27 


.00 


TU 


-.14 


.64 


17 


.83 


.20 


.17 


.11 


.03 


18 


M 


.59 


-.03 


.17 


-.12 


19 


.22 


.50 


.12 


-.r. 


.24 


20 


-.04 


.64 


.16 


.01 


.24 



Note . The underlined values indicate the highest loading of an item 
on a factor. Broken underlines indicate other high loadings. 
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factor and for the total scale for both administrations of the instru- 
ment. The results are shown in Table 11 for the first administration 
and in Table 12 for the second administration. Overall these reliabili- 
ties were fairly high. However, for the first administration, the 
reliability of the time pressure/item easiness factor was relatively 
low. Note th .in the second administration these two item types did 
not load together. In the second administration the only factor not hav- 
ing a high reliability coefficient was the miscellaneous factor. 

Item discrimination indices were calculated for the items on the 
attitude survey by correlating individual item scores with the total 
scores for each examinee. These values are shown in Table 13. Discrim- 
inations were relatively constant across the two administrations, with 
the exception of Item 10. 

Attitude Scale Results Responses obtained from the administration 
of the attitude scale are summarized in Table 14. Response percentages 
for the five categories for each item are shown for both administrations. 

28 
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Table 11 

Coefficient Alpha Reliabilities for Attitude 
Survey Factors and Total Scale for Session I 



Factor Labels Items Coeff. a 



I. CRT Characteristics 
II. Perceived Test Performance/ 

Test Satisfaction 
III. Motivation 
IV. Anxiety 

V. Time Pressure/ Item Easiness 
Total Scale 



Note . Item 11 loaded on its own factor, so no coefficient o could be 
calculated for it alone. 



8. 13. 16 .69 

3, 6. 7, 12. 14. 17 .79 

5. 19. 20 .66 

1. 4. 10. 15. 18 .52 

2. 9 .28 
all 20 items .75 



Table 12 

Coefficient Alpha Reliabilities for Attitude 
Survey Factors and Total Scale for Session II 



Factor Labels Hems Coeff. a 

I. Perceived Test Performance/ 

Test Satisfaction 6. 7. 12. 17 .77 

II. Motivation 5. 18. 19. 20 .66 

III. Anxiety/Time Pressure 2. 4. 10. 13. 15 .74 

IV. Miscellaneous 1. 3. 11. 14 .22 

V. CRT Characteristics/Item Easiness 8. 9. 16 .55 

Total Scale all 20 items .77 



Overall the results of the attitude survey were positive regarding 
attitudes toward the tailored testing situation. Examinees indicated 
that they felt less time pressure when taking the tailored test than 
when taking the conventional test. However, responses Indicated a split 
over whether the examinees felt that they did well on the tailored test, 
and many examinees remained neutral on those items dealing with test per- 
formance. Examinees indicated that they were motivated to do well on 
the test, but felt little anxiety or stress. The examinees responded 
that they felt comfortable with the CRTs, and that the screens were not 
difficult to read. Test items were apparently perceived as neither too 
difficult nor too easy, but examinees were split over whether they believed 
the tailored tests reflected their tr"p knowledge of the material. No 
significant correlations were f oi fc ,ween the attitude scores and the 
ability estimates. 
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Table 13 



Discrimination Indices for Attitude Scale 
Items for Two Test Sessions 



Item No. 


Session I 


Session II 


1 


.26 


.28 


2 


.41 


.41 


3 


.29 


.43 


4 


.45 


.52 


5 


.41 


.36 


6 


.48 


.35 


7 


.59 


.57 


8 


.46 


.45 


9 


.18 


.18 


10 


.28 


.52 


11 


.31 


.28 


12 


.41 


.51 


13 


.54 


.65 


14 


.44 


.51 


15 


.59 


.46 


16 


.56 


.58 


17 


.72 


.65 


18 


.22 


.28 


19 


.33 


.46 


20 


.46 


.37 



Discussion 



In order to fully u.Jerstand the results of the research reported 
here, the results of three tailored testing studies should be kept in 
mind: (a) the application of tailored testing models to a vocabulary 
test (Koch and Reckase, 1978), (b) a previous attempt to apply tailored 
testing models to achievement testing (Koch and Reckase, 1979), and (c) 
the current study. The first study, using the vocabulary test, was success- 
ful, but the success was not surprising, since the vocabulary test used 
was highly unidimensional . However, nonconvergence of the ability esti- 
mates was found to be a problem. The high nonconvergence rate was felt 
to be due to the inappropriate difficulty of the item pool. When an 
attempt was made to apply tailored testing to a multidimensional achieve- 
ment test, the nonconvergence problem was reduced through the use of items 
of appropriate difficulty* but other problems were encountered (e.g., 
low reliabilities), and the attempt at application was unsuccessful. 

There were indications that the lack of success in this first achieve- 
ment testing study might have been due to factors other than the multi- 
dimensional nature of the test, such as the linking pr'^cedures used with 
the calibrations. The current study, in which improvenients were made 
in the operational characteristics of the tailored testing procedures 

30 
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Table 14 

Attitude Scale Response Percentages for 
Item Alternatives over Both Sessions 



Item No. 



Session 



N D SD SA A N D SD 





SA 


A 


1 


0 


4o 


z 


6c 


**D 


3 


c 
0 


CO 
00 


4 


1 




C 

D 


07 
CI 




0 


! 




7 


5 


27 


8 


13 


23 


9 


6 


72 


10 


0 


14 


11 


17 


53 


12 


8 


48 


13 


3 


5 


14 


0 


15 


15 


38 


43 


16 


25 


55 


17 


1 


38 


18 


1 


38 


19 


0 


0 


20 


1 


1 



16 27 8 5 20 17 39 19 
9 10 3 20 49 15 15 1 

34 6 2 7 60 25 8 0 
6 47 38 1 6 7 49 38 

17 8 0 18 53 15 11 2 
63 10 0 0 39 55 7 0 
26 39 3 1 39 30 31 0 
13 35 17 6 24 10 42 18 

23 0 0 6 73 22 0 0 
8 44 34 3 6 6 47 39 

8 18 3 13 51 10 19 7 

24 20 0 2 40 31 26 1 
3 58 31 1 6 6 52 35 

48 38 0 1 17 50 32 0 

11 7 1 31 58 3 6 2 

9 10 1 23 55 11 10 1 

35 22 5 0 28 40 27 5 
19 31 11 2 35 22 34 7 

5 60 35 0 2 5 66 27 

5 51 42 0 2 9 56 33 



Strongly Disagree. For a list of the actual items, see Appendix 
C. 



and the linking procedures, demonstrated that tailored testing could be 
successfully applied to a multidimensional test. If reliability and infor- 
mation functions were used as criteria. Indeed, the current study employed 
virtually the same item pool as the first tailored achievement testing 
studly, but the results were quite different. The difference between these 
two achievement studies was not in the dimensionality of the item pool, 
but in the operational characteristics of the procedures employed. The 
changes that were made and their effects will now be discussed. 



Reliability 

A number of changes implemented during the design of the current 
study probably contributed to the gain in the tailored test reliabilities 
over the previous tailored testing achievement study. One such change 
was the improvement of the linking procedures that were employed. The 
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IPL item parameter estimates were linked using the same method as was 
used in the previous studies. However, previously the linking had been 
done by hand, while this time computer programs were used to perform t'le 
linking. Therefore, any computational er^rs that might have occurred 
in linking should have been eliminated. For this study the 3PL calibra- 
tions were linked using the Maximum Likelihood Method, rather than the 
Least Squares Method that had been used earlier. Again linking was per- 
formed by computer programs instead of by hand. These improvements in 
linking provided more accurate item parameter estimates for the Items 
in the pools. 

Another important change was that larger sample sizes were used 
for item calibration. Sample sizes used ranged from 148 to 314, with 
a mean sample size of 226.5. These were not much larger than the sample 
sizes used in previous studies for the IPL calibrations, but they were 
somewhat larger than the sample sizes used previously for the 3PL cali- 
brations. In the previous tailored achievement testing study the IPL 
sample sizes ranged from 96 to 314, with a mean of 212.82, while 3PL 
sample sizes ranged from 97 to 314. with a mean sample size of 195.4. 
The larger sample sizes may have yielded more stable parameter estimates 
than the previous smaller sample sizes, although Reckase (1977) found 
that these sample sizes were still inadequate for the 3PL calibration. 

Other important changes were in the procedures used in administer- 
ing the tailored tests. For instance, entry points (initial ability 
estimates) for the 3PL procedures were set at the difficulty values on 
either side of the median of the item pool difficulty distribution. In 
earlier studies the entry points were arbitrarily set to be +.5, because 
the item pool was assumed to be centered around zero. This was found 
to not be the case. By using entry points near the median of the diffi- 
culty distribution more items were available within the fixed stepsize 
in ei :her direction. Also. thf» fixed stepsize that was used was .4, 
rather than the .693 that had previously been used for the 3PL proce- 
dure. This helped to avoid the previously encountered problem of moving 
through the item pool too quickly, resulting in premature termination 
of the test. These changes in the entry points and fixed stepsize for 
the 3PL procedure were Important factors in the virtual elimination of 
the problem of nonconvergence and. together with the Improved calibra- 
tions and linkings. probably accounted for the higher reliabilities of 
the tailored tests. 



Information 

In looking at the information yielded by the tailored tests it should 
be remembered that the tailored tests were less than half the length of 
the classroom test. Since total test Information was the sum of the in- 
dividual item information, a drop in total information would be expected 
when considering a shorter test. Despite this, the IPL tailored test 
yielded almost as much information as the classroom test, and the 3PL 
tailored test yielded more information than the classroom test over most 
of the ability range. 
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Goodness of fit 

The superior fit of the 3PL model indicated that the 3PL tailored 
tests demonstrated better 'person' fit than did the IPL tests. It was 
no surprise that the three-parameter model fit observed response data 
better than the one-parameter model. A model with three parameters has 
more flexibility in fitting data than a model with only one parameter. 
Such a finding is consistent with the findings of previous studies (Koc 
and Reckase, 1978, 1979). 



Correlational Analyses 

In correlating the tailored testing ability estimates with the out- 
side criterion variables, it was found that the IPL 1 ability estimates 
correlated significantly higher with Exam II than with Exam I. Also, 
both the IPL 1 and IPL 2 ability estimates correlated significantly higher 
with the total course score than with Exam I. This is somewhat surpris- 
ing, since Exam I was the course exam over the same content as the tailored 
tests. However, this might be explained by examining the reliabilities 
of the course exams. The KR-20 reliabilities of Exam II and the total 
course score were higher than the KR-20 reliability of Exam I. The lower 
reliability of Exam I might be limiting the magnitude of the correlations 
that can be obtained using that test. Of course, this would be true for 
correlations of Exam I with both the IPL and 3PL ability estimates. One 
reason why this effect appeared with the IPL ability estimates and not 
the 3PL ability estimates might be that since the IPL calibrations are 
based on the sum of the factors the IPL tests might have had factors 
in conmon with Exam II. The 3PL calibrations are based on the dominant 
factor, which the 3PL tests would have in conmon with Exam I but not Exam 
II. Any sharing of factors between the IPL tests and Exam II would have 
caused that correlation to be higher than the correlation between the 3PL 
ability estimates and Exam II. However, these explanations are only con- 
jecture, and further studies are needed to determine if these anomalous 
results can be replicated. 



Content Validity 

The content validity results clearly indicated that, even though 
the item pools reflected content area weightings proportionate to the 
classroom test, the tailored test item selection procedures did not main- 
tain these content weightings. For the 3PL procedure this was not sur- 
prising. High item discriminations were not distributed evenly across 
content areas and, since the 3PL procedure selected items on information, 
those content areas having no highly discriminating items were not repre- 
sented at all. Content areas with several high discriminators were weighted 
too heavily relative to the table of specifications. The reason for this 
Imbalance in the distribuclon of Item discriminations was probably caused 
by the loading of the highly discriminating Items on the ^om^na^^^^^J??;- 
Previous research (Reckase, 1977) had indicated that the 3PL model cal bra tes 
Items based on the dominant factor in the test, resulting in low discrim- 
ination values for items loading on the remaining factors, while the IPL 
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procedure calibrates items based on the sum of the factors. Given these 
contrasting tendencies, it is not surprising that the 3PL tailored tests 
used only 25 items out of 183, whereas the IPL tailored tests used 120 
items out of 183. This effect is reflected in the low correlations between 
the IPL and 3PL ability estimates shown in Table 2. 

For the IPL procedure, however, item discriminations were assumed 
to be equal, so the result was somewhat surprising. A possible explana- 
tion is that content areas are not uniformly distributed across the diffi- 
culty scale. The results indicated that, if content areas were to be 
Weighted appropriately, some type of intercontent area branching scheme 
would have to be employed. An alternative to branching might be to adminis 
ter tailored tests over unidimensional subtests and to report a profile 
of scores. Of course, this alternative carries with it the problem of 
identifying unidimensional subtests, as well as determination of a total 
score when one is desired. 



Attitude Survey 

The attitude scale results were generally favorable toward tailored 
testing. However, there was no evidence to indicate any interaction be- 
tween either student motivation or anxiety levels and student test per- 
formance. These findings were consistent with the findings of the previous 
study, which found no significant correlation between attitudes of the 
students toward the tailored tests and their performance. It should be 
emphasized that these studies were performed using college juniors and 
seniors, most of whom were females, and the results may not generalize 
to other groups. 

The factor structure of the attitude scale appeared to be unstable. 
Not only did a number of items switch factors, but the factors themselves 
changed both in number and in their nature. For instance, a number of 
items that loaded on separate factors in the first administration of the 
scale grouped together in the second administration to form a new factor 
that did not occur in the first administration. The items that loaded 
on this new factor, labelled miscellaneous, were items that did not appear 
to be related at all. One possible reason for the unstable factor struc- 
ture of the scale was the small sample size. For a scale of 20 items, 
88 is not an adequate number of subjects to obtain a stable structure. 
It is interesting to note that when an analysis of the factor structure 
of the attitude scale using the skree technique was performed the results 
were ambiguous. The plot of eigenvalues by the factors is shown in Appen- 
dix D. The number of factors determined using the eigenvalue-greater- 
than-one rule gave probably as good an indication of the number of factors 
as that obtained from the skree plot. 



Sunmary and Conclusions 



Past studies Indicated that there might be serious problems with 
the application of tailored testing to multidimensional achievement test- 
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ing. However, there was some evidence that those findings were the result 
of poor item calibration, linking procedures, and test administration 
procedures. The present study showed that If sufficient attention was 
paid to establishing proper operational characteristics, tailored test- 
ing could be successfully applied to multidimensional achievement tests 
to the extent that they yielded high reliabilities and Information. 

The results of this study indicate tnat tailored test reliabilities 
for both the IPL and 3PL procedures were probably higher than the relia- 
bility of the classroom test. The information yielded by the IPL test 
was almost as high as the classroom test information, and the 3PL test 
information was higher than either one. The fit of the two models to 
the response data showed that the 3PL model fit the data better than the 
IPL model. Neither procedure, however, had adequate content validity. 
In suimary. these results showed that tailored testing is a viable proce- 
dure for achievement testing, with the exception of content validly, 
and that the 3PL model appears to be the model of choice. 
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APPENDIX A 



Table A-1 

Administration Dates and Sample Sizes 
of Achievement Tests Calibrated for 
Tailored Testing Usage 



Date Sample Size 

9-76 177 

2-77. 4-77 314 

9-77. 10-77 202 

2-78. 4-78 309 

9-78. 11-78 209 

2-79 148 

Note. Dat^s given in month and year. 



Table A-2 
Table of Specifications for Exam I 



Analysis, 

Knowledge of Synthesis. 
Content Terms and Application and Evaluation 

Areas Techniques of Techniques of Techniques Totals 



Planning the Test 


1 


1 


1 


3 


Behavioral Objectives 


1 


2 


2 


5 


Table of Specifications 


2 


2 


2 


6 


Anecdotal Records 


1 


2 


2 


5 


Rating Scales 


2 


2 


2 


6 


Checklists 


1 


2 


2 


5 


Rankings 


1 


1 


1 


3 


Peer Appraisals 


1 


1 




2 


Self Reports 


1 


1 




2 


Selection Items 


2 


3 


3 


8 


Supply Items 


1 


2 


2 


5 


Totals 


14 


19 


17 


50 



38 
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Appendix 6 
flBlLlTT ESTlHflTE FJHEOUENCT 



0ISTRIBUTI8NS 
FIGURE B-i 
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flBlLlTT ESTIMATE FREQUENCt 
DISTRIBUTIONS 
FIGURE B-3 
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APPENDIX C 

Attitude Survey Administered After Each Tailored Testing Session 

Please circle the response to each statement below which most nearly re- 
flects your feelings or attitude. 

1. During the test I was worried about how well I was doing. 

strongly strongly 
agree agree neutral disagree disagree 

2. I felt less time pressure while taking this computerized test than 
while taking conventional tests. 

strongly strongly 
agree agree neutral disagree disagree 

3. I felt that many of the items were too difficult for me. 

strongly strongly 
disagree disagree neutral agree agree 

4. The computer terminal made me feel that I had to answer the items 
as quickly as possible. 

strongly strongly 
agree agree neutral disagree disagree 

5. I didn't care very much about how well I did on the test. 

strongly ^ 'l^l^^ 

disagree disagree neutral agree agree 

6. I think I did well on the test compared to other people. 

strongly strongly 
agree agree neutral disagree disagree 

7. I felt that rqy performance on this test reflected n^^ true knowledge 
of A140. 

strongly ^^^^"Sly 
disagree disagree neutral agree agree 

8. My eyes were uncomfortable when viewing the screen. 

strongly ^Y^Ztl 
agree agree neutral disagree disagree 

9. 1 felt that most of the items on this test were too easy. 

strongly strongly 
disagree disagree neutral agree agree 
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10. I was nervous about coining here to take this test. 

strongly strongly 
agree agree neutral disagree disagree 

11. The pace of the computer was so slow that it made me impatient. 

strongly strongly 
disagree disagree neutral agree agree 

12. 1 feel that I did as well on this test as on other tests I've taken. 

strongly strongly 
agree agree neutral disagree disagree 

13. The computer terminal made me nervous. 

strongly strongly 
agree agree neutral disagree disagree 

14. I felt confident that I did well on the test. 

strongly 'I'^^ll^^ 
disagree disagree neutral agree agree 

15. I felt considerable stress while taking the test. 

strongly 'Z^f^ 
disagree disagree neutral agree agree 

16. It was easy to read the words and questions on the screen. 

strongly strongly 
agree agree neutral disagree disagree 

17. I felt that the test did a good job of measuring my ability in A140, 

stronalv strongly 
agree agree neutral disagree disagree 

18. I think I could have done better on the test if I had tried harder. 

strongly ^^^°"9ly 
disagree disagree neutral agree agree 

19. I was careful to try to select the best answer to each question. 

strongly s^'^^^S^^ 
disagree disagree neutral agree agree 

20. I tried to finish the test quickly just to receive my extra credit 
points. 

strongly strongly 
agree agree neutral disagree disagree 

i2 
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